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Overdispersed  Generalized  Linear  Models 


Dipak  K.  Dey,  Alan  £.  Gclfand  and  Fengchun  Peng 


ABSTRACT 

Generalized  linear  models  have  become  a  standard  class  of  models  for  data  analysts.  Howev¬ 
er  in  some  applications,  heterogeneity  in  samples  is  too  great  to  be  explained  by  the  simple 
variance  function  implicit  in  such  models.  Utilizing  a  two  parameter  exponential  family 
which  is  overdispersed  relative  to  a  specified  one  parameter  exponential  family  enables  the 
creation  of  classes  of  overdispersed  generalized  linear  models  (OGLM’s)  which  are  analyt¬ 
ically  attractive.  We  propose  fitting  such  models  within  a  Bayesian  framework  employing 
noninformative  priors  in  order  to  let  the  data  drive  the  inference.  Hence  our  analysis  approx¬ 
imates  likelihood-based  inference  but  with  possibly  more  reliable  estimates  of  variability  for 
small  sample  sizes.  Bayesian  calculations  are  carried  out  using  a  Metropolis- within- Gibbs 
sampling  algorithm.  An  illustrative  example  using  a  data  set  involving  damage  incidents  to 
cargo  ships  is  presented.  Details  of  the  data  analysis  are  provided  including  comparison  with 
the  standard  generalized  linear  models  analysis.  Several  diagnostic  tools  reveal  the  improved 
performance  of  the  OGLM. 


KEY  WORDS:  Exponential  families;  Exponential  dispersion  models;  Jefireys’s  prior;  Mix¬ 
ture  models;  Metropolis-within-Gibbs  algorithm;  Orthogonal  parameters. _ _ _ 
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1  INTRODUCTION 


Generalized  linear  models  (GLM)  have  by  now  become  a  standard  class  of  models  in  the 
data  analyst’s  tool  box.  The  evolution  of  these  models  as  well  as  details  regarding  inference, 
fitting,  model  checking,  etc,  is  documented  in  the  book  by  McCullagh  and  Nelder  (1989). 
The  GLIM  software  package  is  widely  available  and  utilized. 

Recently,  such  models  have  been  criticized  in  certain  applications  as  being  too  restrictive 
due  to  the  fact  that  the  variance  is  a  specified  function  of  the  mean.  In  practice  samples  are 
often  found  to  be  too  heterogeneous  to  be  explained  by  such  a  simple  functional  relationship; 
variability  tends  to  be  larger  than  that  captured  through  this  function.  A  natural  remedy  is 
to  consider  a  larger  class  of  models. 

Historically,  the  most  frequently  used  approach  for  creating  a  larger  class  has  been 
through  mixture  models.  For  instance  the  one  parameter  exponential  family  defining  the 
GLM  is  mixed  with  a  two  parameter  exponential  family  for  the  canonical  parameter  9 
(equivalently  the  mean  parameter  p)  resulting  in  a  two  parameter  marginal  mixture  family 
for  the  data.  Shaked  (1980)  showed  that  such  mi-ring  necessarily  inflates  the  model  variance. 
Such  overdispersion  is  of  a  certain  type.  Since  the  likelihood  depends  upon  sample  size  while 
the  mixture  distribution  does  not,  the  relative  overdispersion  of  the  resulting  mixture  family 
to  the  original  exponential  family  tends  to  infinity  as  sample  size  does.  In  other  words,  tak¬ 
ing  additional  observations  within  a  population  does  not  increase  our  knowledge  regarding 
heterogeneity  across  populations.  (See  Gelfand  and  Dalai,  1990  in  this  regard.)  We  also  note 
that  the  resulting  overdispersed  family  of  mixture  models  will  generally  be  awkward  to  work 
with  since  it  will  no  longer  be  an  exponential  family  (e.g.  Beta-binomial,  Poisson-gamma). 

Efron  (1986)  presents  an  alternative  approach  through  so-called  double  exponential  fami¬ 
lies.  Such  families  are  derived  as  the  saddle  point  approximation  to  the  density  of  an  average 
of  n*  random  variables  from  a  one  parameter  exponential  family  for  large  n*  .  The  param¬ 
eter  n*;  written  suggestively  by  Efron  as  np,  0  <  p  <  1  for  actual  sample  size  n,  introduces 
p  as  a  second  parameter  in  the  model  along  with  the  canonical  parameter  6.  Customary 
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inclusion  of  a  dispersion  parameter,  say  <f>,  to  the  usual  one  parameter  exponential  family 
results  in  an  exponential  dispersion  model,  EDM  (Jorgensen,  1987).  Because  <j>  enters  as 
a  sample  size  or  shape  parameter,  associated  inference  is  usually  handled  differently  from 
that  for  Q  oi  fi.  A  one  parameter  exponential  family  in  6  arises  for  each  given  but,  as  a 
two  parameter  model  in  (0,  <f>),  we  no  longer  have  an  exponential  family  .  Recently,  Ganio 
and  Schafer  (1992)  circumvent  this  problem  in  an  approximate  fashion  by  viewing  the  EDM 
as  embedded  within  Efron’s  double  exponential  family  .  The  asymptotics  associated  with 
Efron’s  family  reveal  o^er  dispersion  relative  to  the  original  exponential  family  which  tends 
to  a  constant  asn— »oo,  unlike  the  mixture  case  (Efron,  1986;  Gelfand  &  Dalai,  1990). 

Gelfand  &  Dalai  (1990)  argue  that  an  appeal  to  asymptotics  is  not  necessary  to  justify 
such  models.  More  generally,  for  a  given  one  parameter  exponential  family  they  introduce 
a  two  parameter  exponential  family  of  models  which  is  overdispersed.  This  family  includes 
Efron’s  model  as  a  special  case  and  also  includes  a  family  discussed  in  Lindsay  (1986). 
Retaining  the  exponential  family  structure  simplifies  inference  (as  we  shall  detail  in  the 
subsequent  sections).  Relative  overdispersion  behaves  as  in  Efron’s  models.  Gelfand  and 
Dalai  suggested  that,  with  the  specification  of  link  functions,  these  parameters  could  each  be 
given  the  usual  GLM  structure  but  take  the  matter  no  further.  Our  objective  here  is  to  fully 
examine  such  models  which  we  refer  to  as  overdispersed  generalized  linear  models(OGLM’s). 
Again  the  selling  points  of  our  approach  include  the  familiarity  of  exponential  families, 
the  ready  interpretation  of  model  parameters,  the  unification  of  modeling  by  absorbing 
earlier  cases  and  exact  inference  rather  than  that  from  possibly  inappropriate  asymptotic 
approximations. 

Another  approach  for  handling  heterogeneity  in  GLM’s  is  through  the  use  of  random  ef¬ 
fects.  See,  e.g.,  Breslow  and  Clayton  (1993)  for  a  discussion  and  review  of  the  literature.  In 
the  simplest  version  a  standard  GLM  form  is  employed,  but  for  each  individual  or  population 
in  the  sample,  a  random  effect  is  added  to  the  fixed  covariate  effect  term  in  the  definition 
of  the  mean  structure  resulting  in  a  generalized  linear  mixed  model(GLMM).  These  random 
effects,  sometimes  called  frailties,  are  introduced  to  “soak  up  variability”  and  hence  clarify 
the  explanation  provided  by  the  covariates.  The  likelihood  retains  all  of  the  random  effects. 
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Often  these  individual  effects  are  of  interest  but  even  if  they  are  not,  marginalization  over 
them  is  generally  intractable.  Such  a  marginalized  density  or  likelihood  is  an  example  of 
the  aforementioned  mixing.  With  the  random  effects,  the  likelihood  is  nonregular  in  that, 
as  sample  size  tends  to  infinity  so  does  the  number  of  model  parameters.  Customary  ap¬ 
proaches  for  model  adequacy  and  model  choice  fail  since  the  usual  asymptotic  distribution 
theory  is  no  longer  valid.  OGLM’s  are  regular  and  parsimonious  compared  to  a  GLMM  but 
carry  parameters  in  addition  to  those  of  the  standard  GLM  which  permit  the  possibility  of 
capturing  overdispersion  within  an  exponential  family  framework. 

We  adopt  a  Bayesian  perspective  in  fitting  these  models  since  we  are  drawn  to  the  unifying 
use  of  inference  summaries  based  upon  the  posterior  implicit  therein.  However  we  assume 
that  primary  concern  lies  with  the  modeling  incorporated  in  the  likelihood  and  thus  adopt  an 
objective  Bayesian  stance  employing  noninformative  prior  specifications.  For  large  sample 
sizes  our  inference  will  be  close  to  that  arising  from  maximum  likelihood;  for  smaller  samples 
our  estimates  of  variability  should  be  more  appropriate  than  asymptotic  ones  associated  with 
likelihood  methods.  Required  Bayesian  computation  is  handled  through  a  Metropolis- within- 
Gibbs  Markov  chain  Monte  Carlo  approach  resulting  in  samples  essentially  from  the  joint 
posterior  distribution  which  may  be  summarized  to  provide  any  desired  inference.  Such 
samples  may  also  be  used  as  the  starting  point  for  sampling  from  predictive  distributions 
to  investigate  questions  of  model  adequacy  and  model  choice.  Bayesian  fitting  of  GLM’s  is 
discussed  in  Dellaportas  and  Smith  (1993). 

The  format  of  this  paper  is  then  the  following.  In  section  2  we  formalize  notation  to 
develop  required  likelihoods.  We  also  review  properties  of  the  overdispersed  exponential 
family  of  models.  In  section  3  we  develop  Fisher's  information  matrix  for  these  models  which 
is  important  for  both  likelihood-based  and  Bayesian  inference.  For  the  Bayesian  approach 
there  is  concern  regarding  the  propriety  of  posteriors  arising  under  Jefffeys’s  prior,  which  is 
obtained  from  Fisher’s  information  matrix,  as  well  as  under  a  flat  prior.  In  an  appendix  we 
address  this  question  extending  the  results  of  Ibrahim  and  Laud  (1991).  Finally  in  section 
4  we  undertake  an  analysis  of  a  data  set  involving  wave  damage  to  cargo  vessels.  We  offer  a 
reasonably  complete  posterior  analysis  of  an  OGLM  fitted  to  the  data  as  well  as  comparison 
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with,  the  likelihood  analysis  of  the  standard  GLM.  We  also  use  several  diagnostic  tools  to 
reveal  the  improved  performance  of  the  OGLM. 


2  OVERDISPERSED  GENERALIZED  LINEAR 
MODELS 

Gelfand  and  Dalai  (1990)  discuss  the  class  of  two-parameter  exponential  family  models 
of  the  form 

f(y  |0,r)  =  b{y)edv+rT^-'<e'r)  (1) 

where  if  y  is  continuous,  f  is  assumed  to  be  a  density  with  respect  to  Lebesgue  measure  while 
if  y  is  discrete,  f  is  assumed  to  be  a  density  with  respect  to  counting  measure.  Assuming 
that  (1)  is  integrable  over  ye  y,  they  show  that,  if  T(y)  is  convex  then,  for  a  common  mean, 
var(y)  increases  in  r. 

It  is  presumed  that  the  natural  parameter  space  contains  a  two  dimensional  rectangle 
which,  by  translation,  can  be  taken  to  contain  r  =  0.  Then  the  associated  one  parameter 
exponential  family  arises  at  r  =  0  and  takes  the  form 

/(» I «)  =  (2) 

with  x(0)  =  p(8,  0).  Thus,  as  r  increases  horn  0,  var(y)  increases  relative  to  that  under  the 
associated  one  parameter  exponential  family,  capturing  the  notion  of  overdispersion. 

Expression  (2)  is  the  customary  one  parameter  exponential  family  from  which  a  GLM  is 
developed.  In  particular  p  =  E(y)  =  x'(&)>  var(y)  ~  x"(0)  =  ^(m)-  Here  x'{0)  is  strictly 
increasing  in  9  so  that  p  and  9  are  one-to-one  ( 9  =  (%  )  1  (/*))•  V[p)  is  called  the  variance 
function.  A  GLM  is  defined  through  a  link  function  g,  a  strictly  increasing  differentiable 
transformation  from  p  to  7jeRl,  i.e.,  g(p)  ~  tj  =  xT/3,  where  x  and  /3  are,  respectively,  a 
known  vector  of  explanatory  variables  and  an  unknown  vector  of  model  parameters.  A  more 
general  version  of  (2)  is 

/(y  I  M)  =  (3) 
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Now  wor{y)  =  (a(^))-1  V(/x).  If  y  is  viewed  as  an  average  of  say  n  random  variables  then  a 
usual  form  for  a(^)  is  (n^)-1.  Setting  —  ruf> ,  <f>*  is  referred  to  as  the  dispersion  parameter 
and  (3)  is  called  an  exponential  dispersion  model(EDM)  (Jorgensen,  1987).  Ganio  and 
Schafer  (1992)  extend  the  GLM  based  on  (2)  to  an  EDM,  where  <j>  =  k(zTa)  with  z  a  known 
vector  and  a  an  unknown  parameter  vector.  They  assume  zT a  includes  an  intercept.  The 
two  parameter  family  in  (3)  differs  from  that  in  (1)  in  the  sense  that  (1)  is  a  customary 
two  parameter  exponential  family  whereas  (3)  is  a  customary  one  parameter  family  for  each 
fixed  4>. 

Efron  (1986)  defined  the  double  exponential  family  of  models  through  the  density 

]{y  |  9,  p,  n)  =  c{9,  p,  n)p ’  e^-xW^-'X'Wv-xWv)))  (4) 

where  9(y)  —  (^)-1(y).  Here  y  is  viewed  as  an  average  of  n  i.i.d.  random  variables,  9  is 
the  ca.nonira.1  parameter  and  p  is  a  dispersion  parameter.  In  regression  problems  he  assumes 
a  GLM  in  9  (canonical  link),  i.e.,  9  =  xT(3  and  that  p  =  h(zTc r)  for  a  suitable  h.  Using 
various  expansions,  Efron  shows  that  (4)  permits  attractive  approximation  as  n  grows  large. 
Most  notably,  /  behaves  like  (3)  with  a(^)  =  (np)-1.  Hence  Ganio  and  Schafer  (1992)  treat 
their  extended  GLM,  based  upon  (3),  as  an  example  of  Efron’s  double  exponential  family 
and  carry  out  their  model  fitting  following  his  examples. 

We  note  that,  regardless  of  n,  (4)  is  of  the  form  (1)  with  T(y)  =  9(y)y  —  x{9(y)), 
t  =  n(l  —  p)  and  9  =  np9.  Straightforward  calculation  shows  this  T(y)  is  convex  so  that,  in 
fact,  (4)  is  a  special  case  of  (1).  As  Gelfand  and  Dalai  (1990)  show,  other  choices  of  T{y) 
may  be  more  appropriate  and  in  any  event  it  seems  preferable  to  work  with  the  exact  form 
(1)  rather  than  with  approximation  to  (4). 

Returning  to  (1),  under  usual  regularity  conditions,  we  have  the  following  properties.  If 
P{r,s)  =  ?(1,0)  =  E(V  I  9,r)  =  p,  =  var(y  |  0,r),  p (°-1)  =  E(T{y)  |  9,r),  etc. 

It  is  sometimes  convenient  to  consider  (1)  through  a  mem  parametrization 

f(y  \p,r)  =  6(y)e(y-,1>^l’0,(^r)+TT(w)+^^T).  (5) 

In  (5),  we  employ  the  same  sort  of  notation  for  as  for  p.  By  comparison  with  (1)  we 
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have  8  =  r )  and  p(9,  r)  =  r )  +  ,  r).  Using  (5)  and  straightforward 

•  calculation,  we  can  show  that  E  )  =  0,  i.e,  that  ft  and  r  are  orthogonal  parameters 

in  the  sense  of  B amdorff- Nielsen  (1978,  pl84)  and  Cox  and  Reid  (1987). 

The  only  practical  drawback  to  working  with  (1)  is  that  p(8,r)  is  not  available  explic¬ 
itly.  While  x(£)  i n  (2)  is  usually  an  explicit  function  of  9,  p(8,r)  =  log  /  b(y)e^v+rT(-v^dy 
usually  requires  a  univariate  numerical  integration  or  summation.  In  the  examples  we  have 
investigated  thus  far  this  has  not  presented  a  problem. 

To  create  an  overdispersed  generalized  linear  model  (OGLM)  from  (1)  suppose  we  have 
independent  responses  y»  with  associated  covariates  Xi,p  x  1  and  9  x  1,  i=l,2,..,n.  The 
components  of  x  and  z  need  not  be  exclusive.  Let  y  =  (yi,  yj,  •  •  • ,  yn)  and  define  9i  =  g(xf(3) 
and  Ti  =  h(zfa)  where  g  and  h  are  strictly  increasing.  The  resulting  likelihood  is, 

L(/3,  cr;  y)  =  ft  etf|«+nT(w)"p(<,i>n).  (6) 

»=i 

The  monotonirity  of  g  and  h  is  natural  and  insures  that  0*  is  monotonic  in  x n  and  that 
n  is  monotonic  in  za  facilitating  interpretation.  Of  course  such  monotonicity  does  not 
imply  that,  e.g.,  0*  is  monotone  in  each  covariate.  The  form  xf/3  allows  a  covariate  to 
enter  as  a  polynomial.  If  an  explanatory  variable  appears  in  X{  but  not  in  z»,  say  xu, 
then  =  p^°\8i,Ti)g'(xf/3)/3i.  But  since  p(2’0)  and  g  are  strictly  positive,  /*,•  is  strictly 
monotone  in  xu  with  the  sign  of  j3i  determining  the  direction.  If  the  variable  appears  in  z+ 
as  well,  its  influence  on  fi%  is  less  clear  since  now  involves  the  covariance  between  y 
and  T(y).  In  any  event,  using  a  Bayesian  framework  with  a  sampling-based  implementation 
enables  straightforward  computation  of  |y),  the  posterior  mean  of  m  at  any  x,-  and  z< 
allowing  us  to  study  its  change  as  covariate  levels  change.  In  the  example  of  section  4  we 
have  taken  g  and  h  to  be  the  identity  functions,  in  the  spirit  of  canonical  links.  If  we  wanted 
to  force  overdispersion  (r<  >  0)  we  could,  for  example,  take  T{  =  exp(zfa). 
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3  INFORMATION  CALCULATIONS  AND  PRIOR 
SPECIFICATIONS  FOR  AN  OGLM 


The  Fisher  information  matrix  associated  with  (6)  is  of  interest  for  both  likelihood-based 
inference  as  well  as  Bayesian  inference.  Evaluated  at  the  MLE,  it  provides  an  estimate  of  the 
large  sample  covariance  structure  for  the  MLE’s  of  f3  and  a.  In  the  Bayesian  setting,  again 
evaluated  at  the  MLE  and  again  when  sample  sizes  are  large,  it  provides  an  approximation 
to  the  posterior  covariance  matrix  for  j3  and  a.  Also,  the  square  root  of  the  determinant  of 
this  matrix,  as  a  function  of  (3  and  at ,  is  known  as  Jeffreys  ]s  prior  and  is  commonly  used  as 
a  “noninformative”  specification.  Ibrahim  and  Laud  (1991)  obtain  this  matrix  in  the  case  of 
a  generalized  linear  model  developed  from  (3),  assuming  <i>  is  known,  and  discuss  its  use  as 
a  prior.  We  extended  their  calculation  to  (6). 


Straightforwardly  we  may  show  that 


E  ( 

g(OT^;°’y))  =  (*!<*))■ 


Let  X denote  the  nxp  design  matrix  arising  from  the  x'-s,  Z  the  nxq  design  matrix  arising 
from  the  z-s,  Mg  an  nxn  diagonal  matrix  with  (Mg)x  —  p^2'°\&i,  rt- )(g  (xf/3))2,  MT  an  nxn 
diagonal  matrix  with  ( Mr)ii  =  P^0'2K^itri)(h>'{,zfa))2  “d  MgyT  an  nxn  diagonal  matrix 
with  (Mg<r)a  =  (xf (3))(h! (zf  a)).  Then 


met) 


XTMeX  XTMe,rZ 
ZTMe,TX  ZtMtZ 


and  Jeffreys ’s  prior  is  ]/(/?,  a) j*. 


(7) 


To  work  with  (7)  requires  calculation  of  p,  p^2'°\  p^0,2^  and  This  in  turn,  requires 
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calculation  of  six  integrals  of  the  form  f  yeTd(y)b(y)e9v+T'r^dy  for  the  set  of  (c,d)  from 
{(0,0)  (1,0)  (2,0)  (0,1)  (0,2)  (1,1)}.  Numerical  integration  or  summation  for  such  univariate 
integrals  is  generally  routine. 

Suppose  instead  we  define  an  extension  of  the  GLM  using  the  mean  parametrization  (5), 
setting  p  =  g(xTf3)  with  again  r  =  h(xTa).  In  the  case  of  (2),  i.e.  r  =  0,  this  is,  in  fact, 
the  more  usual  way  of  formulating  a  GLM.  Paralleling  (7)  let  us  now  define  M ^  as  an  nxn 
diagonal  matrix  with  (jWM)i*  =  i>^'°\pi,Ti)(g'  (xiT  (3))2  and  MT  an  nxn  diagonal  matrix 
with  ( Mr)ii  =  Ti)(h>' (ziT a))3 .  The  orthogonality  of  p  and  r  results  in 

I  XTMtlX  0  ^ 

\  0  ZTMrZ  j  ' 

Hence  Jeffreys’s  prior  is  |J(/3,a)|a  =  \XT M llX\i\ZT MrZ\^ . 

It  is  worth  observing  that  obtaining  the  MLE  for  /3  and  a  under  (6)  is  challenging. 
Customary  iteratively  reweighted  least  squares  algorithms  such  as  Fisher  scoring  will  require 
the  calculation  of  p  and  several  of  its  partial  derivatives  at  each  (0i,Ti).  A  grid  search 
algorithm  would  only  require  calculation  of  p  but  will  be  very  inefficient  in  higher  dimensions. 
Hence,  fitting  the  Bayesian  model,  using  a  sampling-based  approach,  is  no  harder  than 
performing  a  likelihood  analysis.  But  then,  an  important  question  to  ask  is  whether  either  a 
flat  prior  for  (J3,  a)  or  Jefireys’s  prior  using  (7),  both  of  which  are  improper,  in  combination 
with  the  likelihood  in  (6),  results  in  a  proper  posterior  for  (J3,  a).  We  provide  an  answer  in 
the  appendix,  extending  work  of  Ibrahim  and  Laud  (1991)  who  address  this  question  in  the 
case  of  a  GLM  developed  from  (3). 

4  AN  OVERDISPERSED  POISSON  MODEL 

McCullagh  and  Nelder  (1989,  p204)  discuss  a  data  set  where  the  response  variable  is  the 
number  of  damage  incidents  by  waves  to  cargo  ships.  For  each  of  34  ships  the  aggregate 
months  in  service  were  recorded  as  well  as  the  number  of  damage  incidents  over  that  period. 


(8) 
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Explanatory  factors  are  ship  type  having  5  levels  (A,  B,  C,  D,  E),  year  of  construction  having 
4  levels  (CPI,  CP2,  CP3,  CP4)  and  period  of  operation  having  two  levels  (SP1,  SP2).  Since 
the  response  is  a  count,  McCuIlagh  and  Nelder  propose  a  Poisson  regression  presuming  that 
the  expected  number  of  damage  incidents  is  directly  proportional  to  the  aggregate  months 
in  service,  i.e.,  the  total  period  of  risk.  Using  a  canonical  link  the  GLM  sets 

9  =  log  (aggregate  months  service)  +  /30  +  effect  due  to  ship  type 

+  effect  due  to  year  of  construction  +  effect  due  to  service  period  (9) 

The  log( aggregate  months  service)  term  is  called  an  offset,  its  coefficient  fixed  at  1  as  a  result 
of  the  proportionality  assumption. 

The  parameter  estimates  and  associated  confidence  intervals  for  (9)  under  the  standard 
GLM  appears  in  the  first  column  of  Table  1.  For  each  factor,  the  first  level  is  taken  as  a  base¬ 
line  and  its  effect  is  set  to  zero.  McCuIlagh  and  Nelder  incorporate  a  dispersion  parameter 
<t>,  obtaining  an  estimate  (f>  =  1.69,  indicating  overdispersion  relative  to  the  standard  Poisson 
density  which,  intrinsically,  has  <f>  -  1.  Of  course,  in  fitting  such  an  EDM,  the  estimates  of 
the  effects,  of  the  and  of  the  m  are  unaffected  by  the  presence  of  <j>.  Our  model  in  (1) 
incorporates  overdispersion  in  a  much  different  way;  under  (1),  m  =  p(°’°)(^,  n),  a  function 
of  the  dispersion  parameter.  McCuIlagh  and  Nelder  note  that  some  points  axe  not  well  fit 
using  their  EDM. 

We  consider  three  Bayesian  models,  all  fit  using  a  flat  prior  for  the  effects.  Model  1 
fits  the  GLM  in  (9)  as  a  reduced  case  of  (1)  with  r  =  0  (and,  of  course,  <j>  =  1)  result¬ 
ing  in  a  9  parameter  model.  With  regard  to  inference  about  the  mean  structure,  model 
1  is  essentially  equivalent  to  the  model  of  McCuIlagh  and  Nelder.  Model  2  incorporates  a 
constant  dispersion  parameter  r  =  aco  with  the  convex  function,  T(y)  =  (y  +  l)log(y  +  1). 
Finally,  anticipating  that  overdispersion  might  increase  with  exposure,  model  3  sets 
r  =  oto  +  Qu  log(aggregate  months  service),  using  the  same  T{y),  an  11  parameter  mod¬ 
el.  The  Bayesian  fits  for  models  1,  2  and  3  are  produced  in  Table  1  as  well.  Comparing 
the  likelihood  analysis  with  the  Bayesian  analysis  for  model  1  there  is  an  indication  that  the 
likelihood  based  confidence  intervals,  arising  under  asymptotic  theory,  may  be  too  long  and 
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too  symmetric. 


The  deviations  between  the  observed  yr  and  the  posterior  mean  E{fir  \  Y)  are  given  in 
Table  2.  The  abundance  of  small  counts  suggest,  that  the  exact  analysis  under  our  model  will 
be  more  appropriate  that  the  asymptotic  models  of  Efron  (1986)  and  of  Ganio  and  Schafer 
(1992).  The  posterior  densities  under  model  3  for  ao  and  a*  are  presented  in  Figures  la 
and  lb  respectively.  These  figures,  along  with  the  last  column  of  Table  1  provide  evidence 
of  overdispersion  (r  >  0)  and,  in  addition,  evidence  that  ati  >  0,  supporting  the  hypothesis 
that  overdispersion  increases  with  exposure  to  risk.  Also  from  Table  2,  model  3  seems  to 
better  fit  the  most  of  larger  yr,  which  are  associated  with  greater  exposure. 

Fitting  models  2  and  3  requires  calculation  of  p(Q,  r)  which  is  the  sum  of  the  form 

p(0,t)  =  log  £  (y!)_1e*w+rrH  (10) 

y=0 

Calculation  of  Jeffreys’  prior  requires  of  computation  of  sums  similar  to  (10)  as  described 
after  (7).  Under  a  flat  prior,  a  proper  posterior  results  following  the  argument  of  the  appendix 
since,  with  canonical  links,  log  concavity  of  (6)  holds.  We  suspect  a  proper  posterior  arises 
using  Jeffreys’  prior  but  the  analytical  approaches  suggested  in  the  appendix  are  not  easily 
applied. 

Lastly,  with  regard  to  models  1,  2  and  3  we  ask  two  questions.  Are  these  models  adequate? 
Amongst  those  that  are,  which  one  would  we  choose?  The  Bayesian  approach  to  answering 
these  questions  is  based  upon  predictive  distributions.  Under  an  improper  flat  prior  the 
marginal  density  of  the  data  (the  prior  predictive  density)  is  improper  hence  impossible  to 
calibrate.  As  an  alternative  we  adopt  a  cross-validation  approach,  paralleling  widely  used 
classical  regression  strategy.  In  particular,  we  consider  the  proper  densities  f(yr  |  P(r)), 
r  “  1,  2,  ••.,  34,  where  y(r)  denotes  y  with  yT  removed.  We,  in  fact,  condition  on  the 
actual  observations  t/(r)t06i  creating  the  predictive  distribution  for  yr  under  the  model  and 
all  the  data  except  yr.  For  model  determination  we  would  then  compare,  in  some  fashion, 
f(yr  I  V(r),o6j)  with  the  T-tA  observation,  yr,ob$-  Such  cross  validation  is  discussed  in  Gelfand, 
Dey  and  Chang  (1992)  and  in  further  references  provided  there. 
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A  natural  approach  for  model  adequacy  is  to  draw,  for  each  r,  a  sample  from 
f(yr  |  y(r)t0iJ,  and  compare  this  sample  with  yr,oiu-  In  particular  using  this  sample  we  might 
obtain  the  .025  and  .975  quantiles  of  the  f(yr  j  y(r)(<*,)  say  and  yr  and  see  how  many  of 
the  34  yr,ob*  £  [i^.,yr].  Under  each  model  at  least  28  of  the  34  intervals  contained  the  corre¬ 
sponding  yr,obs-  Suppose  instead  we  obtain  the  lower  and  upper  quartiles  of  f(yr  |  y (r),obt) 
and  see  how  many  yr,ob»  fall  in  their  interquartile  ranges.  We  find  11  for  model  1,  18  for 
model  2  and  16  for  model  3.  Under  the  true  model  we  would  expect  half,  i.e.,  17.  Hence  both 
model  2  and  model  3  perform  close  to  expectation  though  all  three  models  seem  adequate. 

A  well  established  tool  for  model  choice  is  the  conditional  predictive  ordinate  (CPO), 
f(yr,abs  |  y(r),o6«)-  A  large  value  implies  agreement  between  the  observation  and  the  model. 
For  comparing  models,  the  ratio  dr  =  (or  perhaps  log  dr)  indicates  support 

by  point  r  for  one  model  versus  the  other  (see  Pettit  and  Young,  1990).  Figure  2  provides  a 
plot  of  logCPO  ratios  for  model  3  vs  model  1  and  for  model  3  vs  model  2.  Model  3  emerges  as 
best.  A  simple  diagnostic  with  a  frequentist  flavor  is  (yr,o 6*  —  F(/ir|y))2/34.  For  model 
1  this  statistic  is  25.37,  for  model  2  it  is  12.09  and  for  model  3  it  is  4.60,  again  supporting 
model  3. 

We  recognize  that  the  above  model  determination  diagnostics  are  informal.  However 
they  are  in  the  spirit  of  widely  used  classical  EDA  approaches  and  do  permit  examination 
of  model  performance  at  the  level  of  the  individual  observation. 


5  BRIEF  REMARKS  ON  THE  SAMPLING  BASED 
ANALYSIS  OF  THE  MODELS 

We  conclude  with  a  brief  discussion  regarding  the  Bayesian  fitting  of  the  models,  computation 
of  the  /( yr  |  y(r))  and  the  sampling  from  them  in  the  present  context.  For  each  model  we  used 
a  Metropolis-within-Gibbs  Markov  chain  Monte  Carlo  algorithm  (Muller,  1994)  to  develop 
samples  from  the  posterior,  beginning  with  multiple  starts  in  the  vicinity  of  the  maximum 
likelihood  estimate.  Evaluation  of  the  likelihood  required  repeated  calculation  of  the  function 
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p(0j,  Tj).  These  samples  provide  the  Bayesian  inference  associated  with  Tables  1  and  2  and 
Figure  1. 


As  for  model  determination,  let  (/3*,  a*),  j  =  1,  m  denote  a  sample  of  size  m  from  a 
particular  posterior.  Since 

f(  ,  s  S  f(y  1  a)f(P,  a)df3da 

J\VrtV(r))  //(yw|Aa)/(fl,a)#&« 

=  It)**. 

_ 1 _ 

/  /(yr  |  /3,  a)-1/(/3,  a  |  y)d(3da  ’ 

a  Monte  Carlo  estimate  of  (/(yr  |  y(r)i04f)  based  upon  (/3*,  aj)  is 

/(*  I  yw^)  =  (m-1  £(/(*  I  /3',  a*))-1)-1.  (11) 

j=i 

(11)  is  used  for  CPO  calculations.  See  Gelfand  and  Dey  (1994)  for  more  general  discussion. 

Samples  from  f(yr  |  y^)  are  drawn  in  two  stages.  Given  samples  (/3*,  cr^),  j  =  1, 
2,  ...,  m,  from  f(j3,at  |  y)  we  convert  these  to  samples  from  /(/ 3,  a  |  y^)  by  resam¬ 
pling  with  weights  qj  =  — ■  .  See  Smith  and  Gelfand  (1992)  in  this  re- 

Zj  jml  W\Vr,«*«  \Pj  i&j 

gaxd.  Since  /( yr  |  y(r))  =  //( yr  \  (3,  a)/(/3,  a  |  y(r))d/3dcr,  if  /3',a'  is  a  draw  from 
/(/3,  a  |  yr)  then  if  ~  /( yr  \  a'),  the  marginal  distribution  of  yr'  is  f(yr  |  y(r)). 

APPENDIX:  INTEGRABILITY  OF  POSTERIORS 
FOR  OGLM’S  UNDER  OBJECTIVE  PRIOR  SPEC¬ 
IFICATIONS 

Ibrahim  and  Laud  (1991)  investigate  the  propriety  of  a  Bayesian  GLM,  i.e.,  the  special  case 
when  r  =  0  in  (6),  under  Jeffreys'  prior  which  becomes  \XT  MgX\* .  They  show  that  if  X 
is  full  column  rank  and  the  likelihood  is  bounded  above  then  a  sufficient  condition  for  the 
posterior  of  /3  to  be  proper  is  that  for  each  yi 

[  e^-^ix'i^de  <  oo  ( A1 : 

Je 
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posterior  of  /3  to  be  proper  is  that  for  each  yi 

f  <  oo 

Je 

where  ©  denotes  the  natural  parameter  space  for  the  canonical  parameter  9. 


Since  I(/3,  at)  is  positive  definite,  |I(/3,  a)|^  <  \XT  MbX\^\Zt  MTZ\* .  Hence,  if,  in  (6) 
L(/3,  0;  y )  is  bounded  above,  (Al)  implies  that  f(j3  |  at,  Y )  is  proper.  Similarly  if  L( 0,  a;  Y ) 
is  bounded  above,  if  7 (r)  =  />(0,  r)  for  re r  and  if  for  each  yi 

tt(7*(r))*<fr.  <OQ  (A2) 

then  /(ce,  |  /3,  y)  is  proper.  Of  course  the  fact  that  these  conditional  distributions  are  proper 
does  not  imply  that  the  joint  distribution  is.  Combining  (6)  and  (7),  we  need  to  establish 
when 


J  J L(j3,a; y)|J(/9, at)\*d(3dct  <  00.  (A3) 

Suppose  we  assume  that  X  and  j?  are  of  full  column  rank  and  that,  L(/3,  at;  Y)  is  bounded 
above.  The  fact  that 

|/(0,a)|*  <  (f[(XTM,X)u)Hil^TMrZ)m)i  <M) 

tel  mss  1 

tel  j—l  n@l  j—l 

enables  us  to  imitate  the  argument  of  Ibrahim  and  Laud  to  conclude  that,  if  for  each  y,-, 


Ir  Jq  e<>Wi'^T(vi)'p(S’T)(p(2’0)(^,  t>(°’2)(0,  t))’  d8dr  <  00 


(A5 ) 


then  (A3)  holds.  We  omit  details.  Bamdorff-Nielsen  (1978)  provides  conditions  for  a 
bounded  likelihood  and  that  these  will  usually  hold  for  densities  of  the  form  (1).  The 
condition  (A5)  is  easier  to  work  with  than  (A3)  since  it  is  a  two,  rather  than  p  +  q,  di¬ 
mensional  integral.  However  since  p(2*°)  and  p(°.a)  are  not  available  explicitly  they  must  be 
approximated  within  (A5)  making  analytic  investigation  difficult. 


Integrability  of  the  posterior  may  also  be  investigated  using  a  very  different  approach. 
Suppose  L(/3,a;Y )  is  log  concave.  Then,  if  the  prior  /(/ 3,  a)  is  bounded,  the  posterior  is 
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the  “lower”  one  has  all  positive  slope  coefficients,  the  “upper”  one  all  negative.  The  lower 
one  bounds  logi  on  a  set  say  Sl,  the  upper  one  on  the  complement,  Su-  But  then  upon 
exponentiating,  in  each  case  we  obtain  a  product  of  exponential  curves.  The  exponentiated 
lower  curve  is  immediately  integrable  over  Si,  the  upper  curve  over  Su  whence  L  is  integrable 
and  hence  the  posterior,  if  /(/ 3,  a)  is  bounded. 


Log  concavity  of  L(J3,Q\Y)  is  discussed  in  Wedderbum  (1976)  and  in  Dellaportas  and 
Smith  (1993).  The  former  is  concerned  with  the  behavior  of  MLE’s,  the  latter  with  simplify¬ 
ing  Monte  Carlo  sampling  of  (3.  What  follows  has  implications  for  OGLM’s  in  either  context. 
In  particular,  for  the  OGLM  defined  in  (6),  log  concavity  can  be  established  by  verifying  a 
simple  nonnegative  definiteness  condition.  Letting  9  =  g(rj),  r  =  h( 7)  consider  tfc  action 

<1(77,7)  =  9(ij)y+T('y)T{y)-p(0(ri),T(~f)).  If-0  >  0,  -0  >  0  and 


'flu3* 

d-tdri1 


& 


>0, 


(6)  is  log  concave.  For  the  canonical  case  9  =  77,  r  =  7,  -0  —  var( y),  0  —  var(T(y)) 
and  the  determinant  becomes  var(y)var(T{y))  -  cov2{y,T{y))  which  is  nonnegative  whence 

log  concavity  holds. 


While  the  flat  prior  is  obviously  bounded,  Jeffreys’s  prior  need  not  be  (consider  the  case 
of  the  univariate  normal  standard  deviation).  Using  the  inequality  in  (A4),  \I(J3,  £*)|3  will 
be  bounded  if  the  (M,),-,-  and  ( Afr)«  are.  In  the  canonical  case  this  reduces  to  boundedness 

of  var(y )  and  var(T(y)). 
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Table  1:  Inference  Summaries  for  Models  1,  2  and  3. 


Edm 

Model  2 

Model  3 

Parameters 

mISEH 

-6.410 

-6.444 

-6.337 

-6.351 

B 

( -6.836,  -5.984) 
-0.540 

( -6.863,  -6.055) 
-0.517 

( -6.414,  -6.226) 
-0.571 

(  -6.467,  -6.171) 
-0.567 

C 

( -0.991,  -0.089) 
-0.690 

(  -0.813,  -0.179) 
-0.608 

(  -0.625,  -0.525) 
-0.611 

( -0.738,  -0.339) 
-1.132 

D 

( -1.533,  0.153) 
-0.080 

( -1.156,  0.094) 
-0.089 

(  -1.049,  -0.275) 
0.290 

(  -2.163,  -0.508) 
0.444 

E 

( -0.813,  0.653) 
0.330 

( -0.616,  0.368) 
0.317 

(  0.276,  0.303) 

0.248 

(  0.194,  1.131) 
0.521 

CP2 

( -0.277,  0.937) 
0.700 

( -0.030,  0.731) 
0.719 

(  0.170,  0.309) 

0.534 

(  0.108,  0.734) 
0.356 

CP3 

(  0.328,  1.072) 
0.820 

(  0.380,  1.081) 

0.822 

(  0.420,  0.631) 

0.640 

(  0.182,  0.480) 
0.568 

CP4 

(  0.391,  1.249) 
0.450 

(  0.531,  1.103) 

0.476 

(  0.635,  0.645) 

0.695 

(  0.177,  0.737) 
-0.764 

SP2 

( -0.138,  1.038) 
0.380 

(  0.055,  0.889) 

0.383 

(  0.532,  0.810) 

0.248 

( -1.575,  -0.498) 
0.515 

Oo 

(  0.086,  0.674) 

(  0.151,  0.600) 

(  0.172,  0.343) 

0.034 

(  0.031,  0.040) 

(  0.219,  0.794) 
0.038 

(  0.024,  0.072) 
0.009 

(  0.001,  0.014) 

n 


ZJ. 


Table  2:  Comparison  of  Fits  for  Models  1,  2  and  3.  (Dev)r  —  yr  ■£(/*»■  |y) 


■ 

II 

Model  1 

(DevV 

Model! 

i?G*r|y) 

Model  3 

£(/vly)  (Dev)L 

0.24 

-0.24 

umm 

sswfm 

0.23 

-0.23 

2 

Hi 

0.15 

-0.15 
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Figure  1:  Posterior  Densities  for  a  and  a  under  Model  3. 
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Figure  2:  Log  CPO  Ratios  for  Model  1,  2  and  3  as  Described  in  Section  4. 
(2a):  Model  3  vs  Model  1;  (2b);  Model  3  vs  Model  2. 
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